The dataset contains every sustained or not yet adjudicated violation citation from every full or special program inspection conducted up to three years prior to the most recent inspection for restaurants and college cafeterias in an active status on the RECORD DATE (date of the data pull). When an inspection results in more than one violation, values for associated fields are repeated for each additional violation record. Establishments are uniquely identified by their CAMIS (record ID) number. Only restaurants in an active status are included in the dataset.
Records are also included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations. Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection. Restaurants that received no violations are represented by a single row and coded as having no violations using the ACTION field.
So how are these grades actually determined? Per DOHMH, restaurants receive an initial inspection during which they are assigned points for any violations found. If those points add up to less than 14, the restaurant receives an A and will be reinspected in 12 months. If the restaurant receives 14 or more points, the restaurant will be reinspected 7 or more days later. The score from the re-inspection determines the grade to be posted. If the restaurant scores more than 14 points again, they have the option of posting their B or C grade card or a “GRADE PENDING” card, until their violations are adjudicated through an Office of Administrative Trials and Hearings (OATH) hearing. This process is summarized in the graphic below.
Inspection Process
In other words, each restaurant has three chances per cycle to receive an A (Initial inspection, re-inspection, final adjudication). Note that the initial score also determines how quickly the inspection cycle begins again for each restaurant; restaurants with a higher number of violations are reinspected sooner than those with fewer violations. This graphic also does not describe conditions under which restaurants face closure by the DOHMH.
This dataset and the information on the Health Department’s Restaurant Grading website come from the same data source.
The data about inspection result can be found at https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j
The Health Department’s Restaurant Grading website is here: http://www1.nyc.gov/site/doh/services/restaurant-grades.page
The goal of this project is to analyze and interpret the results across multiple variables and find interesting observations, co-relations or patterns
Load the data set
inspection <- read.csv('..\\data\\DOHMH_New_York_City_Restaurant_Inspection_Results.csv')
knitr::kable(head(inspection))
| CAMIS | DBA | BORO | BUILDING | STREET | ZIPCODE | PHONE | CUISINE.DESCRIPTION | INSPECTION.DATE | ACTION | VIOLATION.CODE | VIOLATION.DESCRIPTION | CRITICAL.FLAG | SCORE | GRADE | GRADE.DATE | RECORD.DATE | INSPECTION.TYPE | Latitude | Longitude | Community.Board | Council.District | Census.Tract | BIN | BBL | NTA |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 41651801 | THE LIBERTY | Manhattan | 29 | WEST 35 STREET | 10001 | 2129674000 | Irish | 06/17/2019 | Violations were cited in the following area(s). | 02G | Cold food item held above 41º F (smoked fish and reduced oxygen packaged foods above 38 ºF) except during necessary preparation. | Y | 12 | A | 06/17/2019 | 12/03/2020 | Cycle Inspection / Re-inspection | 40.74960 | -73.98527 | 105 | 4 | 8400 | 1015891 | 1008370023 | MN17 |
| 50074624 | CAFE 47 | Manhattan | 47 | WEST 66 STREET | 10023 | 6467408369 | American | 10/31/2019 | Violations were cited in the following area(s). | 04L | Evidence of mice or live mice present in facility’s food and/or non-food areas. | Y | 12 | A | 10/31/2019 | 12/03/2020 | Cycle Inspection / Initial Inspection | 40.77299 | -73.98008 | 107 | 6 | 15300 | 1081016 | 1011190021 | MN14 |
| 50082531 | KENNEDY FRIED CHICKEN | Bronx | 1220 | OGDEN AVENUE | 10452 | 6469059373 | Chicken | 12/06/2018 | Violations were cited in the following area(s). | 10D | Mechanical or natural ventilation system not provided, improperly installed, in disrepair and/or fails to prevent excessive build-up of grease, heat, steam condensation vapors, odors, smoke, and fumes. | N | 11 | A | 12/06/2018 | 12/03/2020 | Pre-permit (Operational) / Re-inspection | 40.83825 | -73.92633 | 204 | 16 | 19900 | 2003298 | 2025160022 | BX26 |
| 50071207 | VIOLETTE’S CELLAR | Staten Island | 2271 | HYLAN BOULEVARD | 10306 | 7186505050 | American | 08/29/2018 | Violations were cited in the following area(s). | 04N | Filth flies or food/refuse/sewage-associated (FRSA) flies present in facility’s food and/or non-food areas. Filth flies include house flies, little house flies, blow flies, bottle flies and flesh flies. Food/refuse/sewage-associated flies include fruit flies, drain flies and Phorid flies. | Y | 38 | 12/03/2020 | Cycle Inspection / Initial Inspection | 40.57537 | -74.10483 | 502 | 50 | 12200 | 5107538 | 5036150005 | SI45 | ||
| 50090691 | SALIM KENNEDY FRIED CHICKEN | Queens | 16614 | HILLSIDE AVE | 11432 | 7183744616 | Bangladeshi | 02/20/2020 | Violations were cited in the following area(s). | 08A | Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist. | N | 20 | 12/03/2020 | Cycle Inspection / Initial Inspection | 40.70957 | -73.79619 | 412 | 27 | 46000 | 4210082 | 4098180062 | QN61 | ||
| 40394176 | MOONSTRUCK EAST | Manhattan | 449 | THIRD AVENUE | 10016 | 2122131100 | American | 09/20/2019 | Violations were cited in the following area(s). | 06A | Personal cleanliness inadequate. Outer garment soiled with possible contaminant. Effective hair restraint not worn in an area where food is prepared. | Y | 43 | 12/03/2020 | Cycle Inspection / Initial Inspection | 40.74372 | -73.97957 | 106 | 2 | 7000 | 1082148 | 1009110061 | MN20 |
Columns
colnames(inspection)
## [1] "CAMIS" "DBA" "BORO"
## [4] "BUILDING" "STREET" "ZIPCODE"
## [7] "PHONE" "CUISINE.DESCRIPTION" "INSPECTION.DATE"
## [10] "ACTION" "VIOLATION.CODE" "VIOLATION.DESCRIPTION"
## [13] "CRITICAL.FLAG" "SCORE" "GRADE"
## [16] "GRADE.DATE" "RECORD.DATE" "INSPECTION.TYPE"
## [19] "Latitude" "Longitude" "Community.Board"
## [22] "Council.District" "Census.Tract" "BIN"
## [25] "BBL" "NTA"
Number of rows
orig_rows <- nrow(inspection)
orig_rows
## [1] 399979
Let us remove columns that are not of interest for our analysis and rename some of them that we will keep
drop <- c('BUILDING','STREET','PHONE','RECORD.DATE','GRADE.DATE','Community.Board','Council.District',
'Census.Tract','BIN','BBL','NTA')
inspection[,drop] <- list(NULL)
inspection <- dplyr::rename(inspection,
ESTABLISHMENT = DBA,
BOROUGH = BORO,
CUISINE = CUISINE.DESCRIPTION,
LATITUDE = Latitude,
LONGITUDE = Longitude)
cat('Final columns:\n', paste0('"',colnames(inspection),'"','\n'))
## Final columns:
## "CAMIS"
## "ESTABLISHMENT"
## "BOROUGH"
## "ZIPCODE"
## "CUISINE"
## "INSPECTION.DATE"
## "ACTION"
## "VIOLATION.CODE"
## "VIOLATION.DESCRIPTION"
## "CRITICAL.FLAG"
## "SCORE"
## "GRADE"
## "INSPECTION.TYPE"
## "LATITUDE"
## "LONGITUDE"
Renaming two cuisines for ease of plotting
inspection[inspection$CUISINE ==
'Latin (Cuban, Dominican, Puerto Rican, South & Central American)',
"CUISINE"] = "Latin"
inspection[inspection$CUISINE == "Café/Coffee/Tea",
"CUISINE"] = "Cafe"
paste('Columns renamed')
## [1] "Columns renamed"
Check for unique years in Inspection Date
u_year <- unique(year(mdy(inspection$INSPECTION.DATE)))
paste(u_year)
## [1] "2019" "2018" "2020" "2016" "2017" "1900" "2012" "2014" "2015" "2013"
error_nrows <- length(which(year(mdy(inspection$INSPECTION.DATE)) == "1900"))
paste('Number of rows containing inspection year as 1900:', error_nrows)
## [1] "Number of rows containing inspection year as 1900: 2955"
“1/1/1900” represents that the establishment has not yet been inspected and so we will remove this data from our analysis
inspection <- inspection[year(mdy(inspection$INSPECTION.DATE)) != "1900",]
paste('Number of rows removed:', (orig_rows - nrow(inspection)))
## [1] "Number of rows removed: 2955"
We will remove date from 2012, 2013, 2014, 2015 as these are only around 200 records
paste("Number of rows for year 2012-2015:", nrow(inspection[year(mdy(inspection$INSPECTION.DATE)) %in% c("2012","2013","2014","2015"),]))
## [1] "Number of rows for year 2012-2015: 207"
inspection <- inspection[year(mdy(inspection$INSPECTION.DATE)) %in% c("2016","2017","2018","2019","2020"),]
We will add two columns, one will store year part of inspection date and other will store month. These will come handy during our plots
inspection$YEAR = year(mdy(inspection$INSPECTION.DATE))
inspection$YEAR = as.character(inspection$YEAR)
inspection$MONTH = month(mdy(inspection$INSPECTION.DATE), label = TRUE)
cat('Columns:\n', paste0('"',colnames(inspection),'"','\n'))
## Columns:
## "CAMIS"
## "ESTABLISHMENT"
## "BOROUGH"
## "ZIPCODE"
## "CUISINE"
## "INSPECTION.DATE"
## "ACTION"
## "VIOLATION.CODE"
## "VIOLATION.DESCRIPTION"
## "CRITICAL.FLAG"
## "SCORE"
## "GRADE"
## "INSPECTION.TYPE"
## "LATITUDE"
## "LONGITUDE"
## "YEAR"
## "MONTH"
How many unique establishments are there in the data set?
length(unique(inspection$CAMIS))
## [1] 25896
In which borough the inspections are held
unique(inspection$BOROUGH)
## [1] "Manhattan" "Bronx" "Staten Island" "Queens"
## [5] "Brooklyn" "0"
There are 229 rows with borough as 0. We will remove these as part of our data clean up
inspection <- inspection[inspection$BOROUGH != 0,]
paste("Rows removed for Borough = 0")
## [1] "Rows removed for Borough = 0"
inspection %>%
dplyr::group_by(CAMIS) %>%
dplyr::summarise(n=dplyr::n(), BOROUGH = unique(BOROUGH)) %>%
dplyr::ungroup() %>%
plot_ly(x = ~BOROUGH) %>%
add_histogram() %>%
layout(title = "Number of Restuarants in each borough from Jan 2016 to Nov 2020",
xaxis = list(title="Borough"),
yaxis = list(title="Total number of restaurants"),
bargap = 0.5)
Interpretations
cuisine <- inspection %>%
dplyr::group_by(CAMIS) %>%
dplyr::summarise(CUISINE = unique(CUISINE)) %>%
dplyr::ungroup()
cuisine <- as.data.frame(dplyr::count(cuisine, cuisine$CUISINE))
cuisine <- dplyr::rename(cuisine,
"CUISINE" = "cuisine$CUISINE",
"COUNT" = "n")
cuisine <- cuisine[order(cuisine$COUNT, decreasing = TRUE),]
cuisine[1:10,] %>%
plot_ly(x = ~CUISINE, y = ~COUNT, type = "bar") %>%
layout(title = "Top 10 restaurants by numbers in cuisine category from Jan 2016 to Nov 2020 ",
xaxis = list(categoryorder = "array",
categoryarray = ~CUISINE),
yaxis = list(title="Number of restaurants"),
bargap = 0.5)
Interpretations
inspection %>%
dplyr::group_by(CAMIS, YEAR) %>%
dplyr::summarise(n=as.character(length(unique(INSPECTION.DATE))), YEAR = unique(YEAR)) %>%
dplyr::ungroup() %>%
plot_ly(x = ~YEAR, y = ~n, histfunc = "sum", type = "histogram") %>%
layout(title = "Number of inspections/year",
xaxis = list(title="Year"),
yaxis = list(title="Inspections", type = "linear"),
bargap = 0.5)
#p <- plot_ly(x = inspection$YEAR) %>%
# add_histogram() %>%
# layout(title = "Number of inspections/year",
# xaxis = list(title="Year"),
# yaxis = list(title="Inspections"),
# bargap = 0.4)
#p
Interpretations
inspection %>%
dplyr::group_by(CAMIS, MONTH) %>%
dplyr::summarise(n=as.character(length(unique(INSPECTION.DATE))), MONTH = unique(MONTH)) %>%
dplyr::ungroup() %>%
plot_ly(x = ~MONTH, y = ~n, histfunc = "sum", type = "histogram") %>%
layout(title = "Number of inspections/month",
xaxis = list(title="Month"),
yaxis = list(title="Inspections", type = "linear"),
bargap = 0.5)
#p <- plot_ly(x = inspection$MONTH) %>%
# add_histogram() %>%
# layout(title = "Number of inspections/month (Jan 2016 - Nov 2020)",
# xaxis = list(title="Month"),
# yaxis = list(title="Inspections"),
# bargap = 0.4)
#p
Interpretations
tmp <- inspection %>%
dplyr::group_by(CAMIS, YEAR, MONTH) %>%
dplyr::summarise(n=as.character(length(unique(INSPECTION.DATE))), MONTH = unique(MONTH), YEAR = unique(YEAR)) %>%
dplyr::ungroup()
p <- plot_ly(x = ~MONTH, y = ~n, data = filter(tmp, YEAR == 2020),
histfunc = "sum", type = "histogram", name = "2020") %>%
add_trace(x = ~MONTH, data = filter(tmp, YEAR == 2019), name = "2019") %>%
add_trace(x = ~MONTH, data = filter(tmp, YEAR == 2018), name = "2018") %>%
add_trace(x = ~MONTH, data = filter(tmp, YEAR == 2017), name = "2017") %>%
add_trace(x = ~MONTH, data = filter(tmp, YEAR == 2016), name = "2016") %>%
layout(title = "Number of inspections/month grouped by year",
xaxis = list(title="Month"),
yaxis = list(title="Inspections", type = "linear"),
legend = list(title=list(text='<b> Year </b>')))
p
#p <- plot_ly(x = ~MONTH, data = filter(inspection, YEAR == 2020),
# type = "histogram", name = "2020") %>%
# add_trace(x = ~MONTH, data = filter(inspection, YEAR == 2019), name = "2019") %>%
# add_trace(x = ~MONTH, data = filter(inspection, YEAR == 2018), name = "2018") %>%
# add_trace(x = ~MONTH, data = filter(inspection, YEAR == 2017), name = "2017") %>%
# add_trace(x = ~MONTH, data = filter(inspection, YEAR == 2016), name = "2016") %>%
# layout(title = "Number of inspections/month grouped by year",
# xaxis = list(title="Month"),
# yaxis = list(title="Inspections"),
# legend = list(title=list(text='<b> Year </b>')))
#p
Interpretations
inspection %>%
dplyr::group_by(CAMIS) %>%
dplyr::summarise(n=as.character(length(unique(INSPECTION.DATE))),
BOROUGH = unique(BOROUGH)) %>%
dplyr::ungroup() %>%
plot_ly(x = ~BOROUGH, y = ~n, type = "histogram", histfunc = "sum") %>%
layout(title = "Number of inspections in each borough (Jan 2016 - Nov 2020)",
xaxis = list(title="Borough"),
yaxis = list(title="Inspections", type = "linear"),
bargap = 0.5)
Interpretations
Data Modification
inspnew <- inspection %>%
dplyr::mutate(Date = mdy(INSPECTION.DATE)) %>%
dplyr::group_by(CAMIS, YEAR) %>%
dplyr::filter(Date == max(Date)) %>%
dplyr::ungroup()
invisible(dplyr::ungroup(inspnew))
inspnew <- subset(inspnew, select = -c(Date))
paste("This dataset has", nrow(inspnew), "rows")
## [1] "This dataset has 189641 rows"
Split data by year
df2019 <- inspnew[inspnew$YEAR == 2019,]
paste("2019 dataset has", nrow(df2019), "rows")
## [1] "2019 dataset has 61063 rows"
tmp <- df2019 %>%
dplyr::group_by(CAMIS) %>%
dplyr::summarise(SCORE = unique(SCORE)) %>%
dplyr::ungroup()
paste("Max value of score:", max(tmp$SCORE, na.rm = TRUE))
## [1] "Max value of score: 164"
paste("Min value of score:", min(tmp$SCORE, na.rm = TRUE))
## [1] "Min value of score: -1"
paste("Mean value of score:", mean(tmp$SCORE, na.rm = TRUE))
## [1] "Mean value of score: 11.2379941739492"
paste("Standard Deviation value of score:", sd(tmp$SCORE, na.rm = TRUE))
## [1] "Standard Deviation value of score: 7.7372793777009"
df2019 %>%
dplyr::group_by(CAMIS) %>%
dplyr::summarise(SCORE = unique(SCORE)) %>%
dplyr::ungroup() %>%
plot_ly(x = ~SCORE, type = "histogram") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider()
Interpretations
tmp <- df2019 %>%
dplyr::group_by(CAMIS, BOROUGH) %>%
dplyr::summarise(SCORE = unique(SCORE), BOROUGH = unique(BOROUGH)) %>%
dplyr::ungroup()
brklyn <- tmp[tmp$BOROUGH == "Brooklyn",]
bronx <- tmp[tmp$BOROUGH == "Bronx",]
mnhttn <- tmp[tmp$BOROUGH == "Manhattan",]
queens <- tmp[tmp$BOROUGH == "Queens",]
staten <- tmp[tmp$BOROUGH == "Staten Island",]
fit_brklyn <- density(brklyn$SCORE, na.rm = TRUE)
fit_bronx <- density(bronx$SCORE, na.rm = TRUE)
fit_mnhttn <- density(mnhttn$SCORE, na.rm = TRUE)
fit_queens <- density(queens$SCORE, na.rm = TRUE)
fit_staten <- density(staten$SCORE, na.rm = TRUE)
plot_ly(x = ~fit_brklyn$x, y = ~fit_brklyn$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Brooklyn") %>%
add_trace(x = ~fit_bronx$x, y = ~fit_bronx$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Bronx") %>%
add_trace(x = ~fit_mnhttn$x, y = ~fit_mnhttn$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Manhattan") %>%
add_trace(x = ~fit_queens$x, y = ~fit_queens$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Queens") %>%
add_trace(x = ~fit_staten$x, y = ~fit_staten$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Staten Island") %>%
layout(title = "Distribution density of violation scores in 5 Boroughs - 2019",
xaxis = list(title = 'Score', range = c(-1, 27)),
yaxis = list(title = 'Density'))
Interpretations
tmp <- df2019 %>%
dplyr::group_by(CAMIS, GRADE) %>%
dplyr::summarise(GRADE = unique(GRADE)) %>%
dplyr::ungroup()
tmp <- tmp[tmp$GRADE != "",]
tmp$CAMIS <- NULL
tmp <- data.frame(table(tmp) * 100 / sum(table(tmp)))
p <- plot_ly(x = tmp$tmp, y = tmp$Freq, type = "bar") %>%
layout(title = "Grade percentage",
xaxis = list(title="Grade"),
yaxis = list(title="Percentage", range=c(0,100), ticksuffix="%"),
bargap = 0.5)
p
Interpretations
tmp <- df2019 %>%
dplyr::group_by(CAMIS, GRADE) %>%
dplyr::summarise(GRADE = unique(GRADE), BOROUGH = unique(BOROUGH)) %>%
dplyr::ungroup()
tmp %>% filter(GRADE %in% c("A","B","C")) %>%
dplyr::count(BOROUGH,GRADE) %>%
ggplot(., aes(x=BOROUGH, y=n,fill=GRADE)) +
geom_bar(stat="identity", width=0.4, alpha=0.8, color="black") +
facet_grid(GRADE~., scale="free") +
geom_text(aes(label=n), vjust=1.2) +
labs(x="Borough", y="Counts", title = "Grade distribution by borough - 2019") +
theme_grey()
Interpretations
tmp %>%
filter(GRADE %in% c("A","B","C")) %>%
dplyr::group_by(BOROUGH, GRADE) %>%
dplyr::summarise(count = dplyr::n()) %>%
dplyr::mutate(Percent=count/sum(count)*100) %>%
dplyr::ungroup() %>%
ggplot(., aes(x=BOROUGH, y=Percent, fill=GRADE)) +
geom_bar(aes(group=BOROUGH), stat="identity", width=0.4, alpha=0.8, color="black") +
labs(x="", y="Percentage") + coord_flip() + theme_bw()
Interpretations
topc <- cuisine[1:10,'CUISINE' ]
tmp <- df2019 %>%
dplyr::filter(GRADE %in% c("A","B","C")) %>%
dplyr::filter(CUISINE %in% topc) %>%
dplyr::group_by(CUISINE, GRADE) %>%
dplyr::summarise(Count=dplyr::n()) %>%
dplyr::mutate(Percent=Count/sum(Count)*100) %>%
dplyr::ungroup()
p <- plot_ly() %>%
add_pie(data = filter(tmp, CUISINE == "American"), labels = ~GRADE, values = ~Percent,
name = "American", domain = list(row = 0, column = 0),
marker = list(colors=c("green", "orange", "red")), title = "<b>American</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Chinese"), labels = ~GRADE, values = ~Percent,
name = "Chinese", domain = list(row = 0, column = 1),
marker = list(colors=c("green", "orange", "red")), title = "<b>Chinese</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Cafe"), labels = ~GRADE, values = ~Percent,
name = "Cafe", domain = list(row = 0, column = 2),
marker = list(colors=c("green", "orange", "red")), title = "<b>Cafe</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Pizza"), labels = ~GRADE, values = ~Percent,
name = "Pizza", domain = list(row = 0, column = 3),
marker = list(colors=c("green", "orange", "red")), title = "<b>Pizza</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Mexican"), labels = ~GRADE, values = ~Percent,
name = "Mexican", domain = list(row = 0, column = 4),
marker = list(colors=c("green", "orange", "red")), title = "<b>Mexican</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Italian"), labels = ~GRADE, values = ~Percent,
name = "Italian", domain = list(row = 1, column = 0),
marker = list(colors=c("green", "orange", "red")), title = "<b>Italian</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Japanese"), labels = ~GRADE, values = ~Percent,
name = "Japanese", domain = list(row = 1, column = 1),
marker = list(colors=c("green", "orange", "red")), title = "<b>Japanese</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Latin"), labels = ~GRADE, values = ~Percent,
name = "Latin", domain = list(row = 1, column = 2),
marker = list(colors=c("green", "orange", "red")), title = "<b>Latin</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Bakery"), labels = ~GRADE, values = ~Percent,
name = "Bakery", domain = list(row = 1, column = 3),
marker = list(colors=c("green", "orange", "red")), title = "<b>Bakery</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Caribbean"), labels = ~GRADE, values = ~Percent,
name = "Caribbean", domain = list(row = 1, column = 4),
marker = list(colors=c("green", "orange", "red")), title = "<b>Caribbean</b>") %>%
layout(title = "Grade distribution among top 10 restaurants", showlegend = T,
grid=list(rows=2, columns=5),
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
invisible(dplyr::ungroup(df2019))
Interpretations
tmp <- df2019 %>%
dplyr::filter(CRITICAL.FLAG %in% c("Y", "N")) %>%
dplyr::filter(CUISINE %in% topc) %>%
dplyr::group_by(CUISINE, CRITICAL.FLAG) %>%
dplyr::summarise(Count=dplyr::n()) %>%
dplyr::mutate(Percent=Count/sum(Count)*100) %>%
dplyr::ungroup()
p <- plot_ly() %>%
add_pie(data = filter(tmp, CUISINE == "American"), labels = ~CRITICAL.FLAG, values = ~Percent,
name = "American", domain = list(row = 0, column = 0),
marker = list(colors=c("green", "red")), title = "<b>American</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Chinese"), labels = ~CRITICAL.FLAG, values = ~Percent,
name = "Chinese", domain = list(row = 0, column = 1),
marker = list(colors=c("green", "red")), title = "<b>Chinese</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Cafe"), labels = ~CRITICAL.FLAG, values = ~Percent,
name = "Cafe", domain = list(row = 0, column = 2),
marker = list(colors=c("green", "red")), title = "<b>Cafe</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Pizza"), labels = ~CRITICAL.FLAG, values = ~Percent,
name = "Pizza", domain = list(row = 0, column = 3),
marker = list(colors=c("green", "red")), title = "<b>Pizza</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Mexican"), labels = ~CRITICAL.FLAG, values = ~Percent,
name = "Mexican", domain = list(row = 0, column = 4),
marker = list(colors=c("green", "red")), title = "<b>Mexican</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Italian"), labels = ~CRITICAL.FLAG, values = ~Percent,
name = "Italian", domain = list(row = 1, column = 0),
marker = list(colors=c("green", "red")), title = "<b>Italian</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Japanese"), labels = ~CRITICAL.FLAG, values = ~Percent,
name = "Japanese", domain = list(row = 1, column = 1),
marker = list(colors=c("green", "red")), title = "<b>Japanese</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Latin"), labels = ~CRITICAL.FLAG, values = ~Percent,
name = "Latin", domain = list(row = 1, column = 2),
marker = list(colors=c("green", "red")), title = "<b>Latin</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Bakery"), labels = ~CRITICAL.FLAG, values = ~Percent,
name = "Bakery", domain = list(row = 1, column = 3),
marker = list(colors=c("green", "red")), title = "<b>Bakery</b>") %>%
add_pie(data = filter(tmp, CUISINE == "Caribbean"), labels = ~CRITICAL.FLAG, values = ~Percent,
name = "Caribbean", domain = list(row = 1, column = 4),
marker = list(colors=c("green", "red")), title = "<b>Caribbean</b>") %>%
layout(title = "Critical flag distribution among top 10 restaurants", showlegend = T,
grid=list(rows=2, columns=5),
xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
p
Interpretations
tmp <- df2019 %>%
dplyr::group_by(CAMIS, GRADE) %>%
dplyr::summarise(GRADE = unique(GRADE), SCORE = unique(SCORE)) %>%
dplyr::ungroup()
tmp <- tmp[tmp$SCORE >= 0 & tmp$GRADE %in% c('A', 'B', 'C'), ]
grade <- tmp$GRADE
fact <- as.factor(grade)
fact <- factor(fact, levels = c("C", "B", "A"))
fig <- plot_ly(tmp,
y=tmp$SCORE,
color = fact,
type = "box",
boxmean = "sd",
colors = c("Red","Orange","green")) %>%
layout(title = "Score distribution by grades",
xaxis = list(title = "Grade",
zeroline = FALSE),
yaxis = list(title = "Violation Score",
zeroline = FALSE))
fig
Interpretations
tmp_top10 <- df2019 %>%
dplyr::group_by(CAMIS) %>%
dplyr::summarise(CUISINE = unique(CUISINE), SCORE = unique(SCORE)) %>%
dplyr::ungroup()
top <- cuisine$CUISINE[1:10]
tmp_top10 <- tmp_top10[tmp_top10$CUISINE %in% top,]
tmp_top10 <- na.omit(tmp_top10)
subplot(
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "American" & SCORE <= 30),
name = "American") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Chinese" & SCORE <= 30),
name = "Chinese") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Cafe" & SCORE <= 30),
name = "Cafe") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Pizza" & SCORE <= 30),
name = "Pizza") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Mexican" & SCORE <= 30),
name = "Mexican") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider()
)
subplot(
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Italian" & SCORE <= 30),
name = "Italian") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Japanese" & SCORE <= 30),
name = "Japanese") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Latin" & SCORE <= 30),
name = "Latin") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Bakery" & SCORE <= 30),
name = "Bakery") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Caribbean" & SCORE <= 30),
name = "Caribbean") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider()
)
Interpretations
v <- df2019$VIOLATION.CODE
v <- v[!is.na(v)]
v <- v[!(v == "")]
violation <- df2019 %>%
dplyr::group_by(VIOLATION.CODE) %>%
dplyr::summarise(DESC = unique(VIOLATION.DESCRIPTION)) %>%
dplyr::ungroup()
tmp <- data.frame(table(v) * 100 / sum(table(v)))
tmp <- tmp[order(tmp$Freq, decreasing = TRUE),]
topv <- as.character(tmp$v[1:10])
p <- plot_ly(x = tmp$v, y = tmp$Freq, type = "bar") %>%
layout(title = "Violation codes as %age of total violations",
xaxis = list(title="Violation Code", categoryorder = "array",
categoryarray = tmp$Freq),
yaxis = list(title="Percentage", ticksuffix="%")) %>%
rangeslider()
p
Interpretations
itype <- df2019 %>%
dplyr::group_by(INSPECTION.TYPE) %>%
dplyr::summarise(n = dplyr::n()) %>%
dplyr::ungroup()
itype$n <- (itype$n * 100)/sum(itype$n)
itype <- itype[order(itype$n, decreasing = TRUE),]
itype <- itype[1:5,]
p <- plot_ly(x = itype$INSPECTION.TYPE, y = itype$n, type = "bar") %>%
layout(title = "Top 5 Inspection type",
xaxis = list(title="Inspection Types", categoryorder = "array",
categoryarray = tmp$Freq),
yaxis = list(title="Percentage", ticksuffix="%"))
p
Interpretations
df2019 %>%
dplyr::group_by(CAMIS) %>%
dplyr::summarise(SCORE = unique(SCORE)) %>%
dplyr::ungroup() %>%
plot_ly(x = ~SCORE, type = "histogram") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count")) %>%
rangeslider()
Below are the stats for the scores for year 2019
tmp <- df2019 %>%
dplyr::group_by(CAMIS) %>%
dplyr::summarise(SCORE = unique(SCORE)) %>%
dplyr::ungroup()
paste("Max value of score:", max(tmp$SCORE, na.rm = TRUE))
paste("\nMin value of score:", min(tmp$SCORE, na.rm = TRUE))
paste("\nMean value of score:", mean(tmp$SCORE, na.rm = TRUE))
paste("\nStandard Deviation value of score:", sd(tmp$SCORE, na.rm = TRUE))
## [1] "Max value of score: 164"
## [1] "\nMin value of score: -1"
## [1] "\nMean value of score: 11.2379941739492"
## [1] "\nStandard Deviation value of score: 7.7372793777009"
set.seed(3333)
tmp <- na.omit(tmp)
samples <- 1000
sample.size <- 10
xbar1 <- numeric(samples)
for (i in 1:samples){
xbar1[i] <- mean(sample(tmp$SCORE, sample.size, replace = TRUE))
}
cat("\n")
cat("For 1000 random samples of size ", sample.size)
cat("\nMean value of score: ", mean(xbar1))
cat("\nStandard Deviation value of score: ", sd(xbar1))
sample.size <- 20
xbar2 <- numeric(samples)
for (i in 1:samples){
xbar2[i] <- mean(sample(tmp$SCORE, sample.size, replace = TRUE))
}
cat("\n")
cat("For 1000 random samples of size ", sample.size)
cat("\nMean value of score: ", mean(xbar2))
cat("\nStandard Deviation value of score: ", sd(xbar2))
sample.size <- 30
xbar3 <- numeric(samples)
for (i in 1:samples){
xbar3[i] <- mean(sample(tmp$SCORE, sample.size, replace = TRUE))
}
cat("\n")
cat("For 1000 random samples of size ", sample.size)
cat("\nMean value of score: ", mean(xbar3))
cat("\nStandard Deviation value of score: ", sd(xbar3))
sample.size <- 40
xbar4 <- numeric(samples)
for (i in 1:samples){
xbar4[i] <- mean(sample(tmp$SCORE, sample.size, replace = TRUE))
}
cat("\n")
cat("For 1000 random samples of size ", sample.size)
cat("\nMean value of score: ", mean(xbar4))
cat("\nStandard Deviation value of score: ", sd(xbar4))
fit_10 <- density(xbar1)
fit_20 <- density(xbar2)
fit_30 <- density(xbar3)
fit_40 <- density(xbar4)
##
## For 1000 random samples of size 10
## Mean value of score: 11.2592
## Standard Deviation value of score: 2.502838
## For 1000 random samples of size 20
## Mean value of score: 11.1878
## Standard Deviation value of score: 1.736039
## For 1000 random samples of size 30
## Mean value of score: 11.2289
## Standard Deviation value of score: 1.401399
## For 1000 random samples of size 40
## Mean value of score: 11.25452
## Standard Deviation value of score: 1.170956
plot_ly(x = ~fit_10$x, y = ~fit_10$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Sample size: 10") %>%
add_trace(x = ~fit_20$x, y = ~fit_20$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Sample size: 20") %>%
add_trace(x = ~fit_30$x, y = ~fit_30$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Sample size: 30") %>%
add_trace(x = ~fit_40$x, y = ~fit_40$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Sample size: 40") %>%
layout(title = "Distribution density of various sample sizes for violation score",
xaxis = list(title = 'Score'),
yaxis = list(title = 'Density'))
Interpretations
We have already covered this above in Central Limit Theorem
Let us take 1000 samples of size 50 for Equal probabilities sampling techniques
scores <- tmp$SCORE
scores <- scores[scores >= 0]
samples <- 1000
N <- length(scores)
n <- 50
x_scores <- numeric(samples)
for (i in 1:samples){
k <- ceiling(N/n)
r <- sample(scores, 1)
xbar <- seq(r, by = k, length = n)
for (j in xbar){
x_scores[i] <- mean(scores[xbar])
}
}
cat("\n")
cat("For 1000 random samples of size ", n)
cat("\nMean value of score: ", mean(x_scores))
cat("\nStandard Deviation value of score: ", sd(x_scores))
fit <- density(x_scores)
##
## For 1000 random samples of size 50
## Mean value of score: 11.1603
## Standard Deviation value of score: 0.9267514
plot_ly(x = ~fit$x, y = ~fit$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Sample size: 50") %>%
layout(title = "Equal Probability: Distribution density of sample size 50 for violation score",
xaxis = list(title = 'Score'),
yaxis = list(title = 'Density'))
Interpretations
Let us take 1000 samples of size 50 for Equal probabilities sampling techniques
scores_nonzero <- scores[scores > 0]
samples <- 1000
x_scores <- numeric(samples)
for (i in 1:samples){
pik <- inclusionprobabilities(scores_nonzero, 50)
s <- UPsystematic(pik)
sample <- scores_nonzero[s!=0]
x_scores[i] <- mean(sample)
}
cat("\n")
cat("For 1000 random samples of size ", n)
cat("\nMean value of score: ", mean(x_scores))
cat("\nStandard Deviation value of score: ", sd(x_scores))
fit <- density(x_scores)
##
## For 1000 random samples of size 50
## Mean value of score: 16.53038
## Standard Deviation value of score: 1.835034
plot_ly(x = ~fit$x, y = ~fit$y, type = 'scatter', mode = 'lines',
fill = 'tozeroy', name = "Sample size: 50") %>%
layout(title = "Unequal Probability: Distribution density of sample size 50 for violation score",
xaxis = list(title = 'Score'),
yaxis = list(title = 'Density'))
Interpretations
Stratified sampling on scores based on top 10 cuisine Following is the freq table for the top 10 restaurants
tmp <- df2019 %>%
dplyr::filter(CUISINE %in% topc) %>%
dplyr::group_by(CAMIS) %>%
dplyr::summarise(CUISINE = unique(CUISINE), SCORE = unique(SCORE)) %>%
dplyr::ungroup()
tmp$CAMIS <- NULL
tmp <- tmp[tmp$SCORE >= 0,]
tmp <- na.omit(tmp)
table(tmp$CUISINE)
##
## American Bakery Cafe Caribbean Chinese Italian Japanese Latin
## 5401 732 1712 653 2212 906 806 778
## Mexican Pizza
## 915 1117
Let us show an example of how to get 50 samples by stratified sampling on our data
n <- 50
freq <- table(tmp$CUISINE)
st.sizes <- n * freq / sum(freq)
st <- strata(tmp, stratanames = c("CUISINE"), size = st.sizes, method = "srswor",
description = TRUE)
## Stratum 1
##
## Population total and number of selected units: 5401 17.72912
## Stratum 2
##
## Population total and number of selected units: 2212 2.402836
## Stratum 3
##
## Population total and number of selected units: 653 5.619748
## Stratum 4
##
## Population total and number of selected units: 1117 2.143514
## Stratum 5
##
## Population total and number of selected units: 778 7.261029
## Stratum 6
##
## Population total and number of selected units: 906 2.974002
## Stratum 7
##
## Population total and number of selected units: 1712 2.645746
## Stratum 8
##
## Population total and number of selected units: 915 2.553834
## Stratum 9
##
## Population total and number of selected units: 806 3.003545
## Stratum 10
##
## Population total and number of selected units: 732 3.666623
## Number of strata 10
## Total number of selected units 50
We can check what samples were taken
st
## CUISINE ID_unit Prob Stratum
## 393 American 393 0.003282563 1
## 710 American 710 0.003282563 1
## 1304 American 1304 0.003282563 1
## 2123 American 2123 0.003282563 1
## 2788 American 2788 0.003282563 1
## 4184 American 4184 0.003282563 1
## 4410 American 4410 0.003282563 1
## 4704 American 4704 0.003282563 1
## 4758 American 4758 0.003282563 1
## 6054 American 6054 0.003282563 1
## 7539 American 7539 0.003282563 1
## 10772 American 10772 0.003282563 1
## 10788 American 10788 0.003282563 1
## 12548 American 12548 0.003282563 1
## 14018 American 14018 0.003282563 1
## 14787 American 14787 0.003282563 1
## 15100 American 15100 0.003282563 1
## 1494 Chinese 1494 0.001086273 2
## 8377 Chinese 8377 0.001086273 2
## 3232 Caribbean 3232 0.008606046 3
## 4399 Caribbean 4399 0.008606046 3
## 8328 Caribbean 8328 0.008606046 3
## 10691 Caribbean 10691 0.008606046 3
## 13112 Caribbean 13112 0.008606046 3
## 2255 Pizza 2255 0.001918992 4
## 8028 Pizza 8028 0.001918992 4
## 1843 Latin 1843 0.009332943 5
## 1934 Latin 1934 0.009332943 5
## 2451 Latin 2451 0.009332943 5
## 2778 Latin 2778 0.009332943 5
## 8863 Latin 8863 0.009332943 5
## 9325 Latin 9325 0.009332943 5
## 11578 Latin 11578 0.009332943 5
## 1070 Italian 1070 0.003282563 6
## 2840 Italian 2840 0.003282563 6
## 5119 Cafe 5119 0.001545412 7
## 5693 Cafe 5693 0.001545412 7
## 8144 Mexican 8144 0.002791075 8
## 14700 Mexican 14700 0.002791075 8
## 3315 Japanese 3315 0.003726483 9
## 10853 Japanese 10853 0.003726483 9
## 12069 Japanese 12069 0.003726483 9
## 1237 Bakery 1237 0.005009048 10
## 1263 Bakery 1263 0.005009048 10
## 6527 Bakery 6527 0.005009048 10
Let us repeat this step 1000 times and get the distribution
sc <- tmp[st$ID_unit,'SCORE']
dfs <- as.data.frame(st$CUISINE)
dfs$SCORE <- sc$SCORE
dfs <- dplyr::rename(dfs,"CUISINE" = "st$CUISINE")
for (i in 1:999){
st <- strata(tmp, stratanames = c("CUISINE"), size = st.sizes, method = "srswor",
description = FALSE)
sc <- tmp[st$ID_unit,'SCORE']
tmp_dfs <- as.data.frame(st$CUISINE)
tmp_dfs$SCORE <- sc$SCORE
tmp_dfs <- dplyr::rename(tmp_dfs,"CUISINE" = "st$CUISINE")
dfs <- rbind(dfs, tmp_dfs)
}
table(dfs$CUISINE)
##
## American Bakery Cafe Caribbean Chinese Italian Japanese Latin
## 17000 3000 2000 5000 2000 2000 3000 7000
## Mexican Pizza
## 2000 2000
We can now plot the score distribution. Let us compare with the top 5
subplot(
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "American" & SCORE <= 30),
name = "American") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Chinese" & SCORE <= 30),
name = "Chinese") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Cafe" & SCORE <= 30),
name = "Cafe") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Pizza" & SCORE <= 30),
name = "Pizza") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(tmp_top10, CUISINE == "Mexican" & SCORE <= 30),
name = "Mexican") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider()
)
subplot(
plot_ly(x = ~SCORE, type = "histogram", data = filter(dfs, CUISINE == "American" & SCORE <= 30),
name = "American") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(dfs, CUISINE == "Chinese" & SCORE <= 30),
name = "Chinese") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(dfs, CUISINE == "Cafe" & SCORE <= 30),
name = "Cafe") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(dfs, CUISINE == "Pizza" & SCORE <= 30),
name = "Pizza") %>%
layout(title = "Distribution of violation scores - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider(),
plot_ly(x = ~SCORE, type = "histogram", data = filter(dfs, CUISINE == "Mexican" & SCORE <= 30),
name = "Mexican") %>%
layout(title = "Distribution of violation scores via Stratified Sampling - 2019",
xaxis = list(title = "Score", dtick = 2, tickmode = "linear"),
yaxis = list(title = "Count"),
bargap = 0.3) %>%
rangeslider()
)
Lets check the stats
for (i in 1:10) {
pop <- tmp_top10[tmp_top10$CUISINE == topc[i], c('SCORE')]
smp <- dfs[dfs$CUISINE == topc[i], c('SCORE')]
cat("\n\n", topc[i])
cat("\nMean of population:", mean(pop$SCORE))
cat("\nSD of population:", sd(pop$SCORE))
cat("\nMean of samples:", mean(smp))
cat("\nSD of samples:", sd(smp))
}
##
##
## American
## Mean of population: 10.72927
## SD of population: 6.841898
## Mean of samples: 10.75671
## SD of samples: 6.833053
##
## Chinese
## Mean of population: 12.02167
## SD of population: 7.854681
## Mean of samples: 11.9575
## SD of samples: 7.658199
##
## Cafe
## Mean of population: 9.183197
## SD of population: 5.989311
## Mean of samples: 9.3995
## SD of samples: 6.279141
##
## Pizza
## Mean of population: 10.82931
## SD of population: 7.127914
## Mean of samples: 10.883
## SD of samples: 7.409302
##
## Mexican
## Mean of population: 11.56707
## SD of population: 7.638183
## Mean of samples: 11.748
## SD of samples: 7.550629
##
## Italian
## Mean of population: 11.24917
## SD of population: 7.00171
## Mean of samples: 11.448
## SD of samples: 6.803708
##
## Japanese
## Mean of population: 11.96283
## SD of population: 7.484549
## Mean of samples: 11.83767
## SD of samples: 7.293448
##
## Latin
## Mean of population: 12.61361
## SD of population: 7.945241
## Mean of samples: 12.60343
## SD of samples: 7.711426
##
## Bakery
## Mean of population: 11.15143
## SD of population: 8.254837
## Mean of samples: 10.86433
## SD of samples: 7.899371
##
## Caribbean
## Mean of population: 13.50459
## SD of population: 11.01158
## Mean of samples: 13.5194
## SD of samples: 11.1202
Looking at the below graph (barring 2020), we can see that the number of restaurants receiving grade A is increasing which presents a good outlook for the people who take resturant eating into consideration
tmp <- inspnew %>%
dplyr::filter(GRADE %in% c("A","B","C")) %>%
dplyr::group_by(YEAR, GRADE) %>%
dplyr::summarise(count = dplyr::n()) %>%
dplyr::ungroup()
tmp$CAMIS <- NULL
A <- tmp[tmp$GRADE == "A", c("YEAR","count")]
B <- tmp[tmp$GRADE == "B", c("YEAR","count")]
C <- tmp[tmp$GRADE == "C", c("YEAR","count")]
plot_ly(x = A$YEAR, y=A$count, type="scatter", mode = 'lines+markers', name = "A", color = "green") %>%
add_trace(x = B$YEAR, y=B$count, type="scatter", mode = 'lines+markers', name = "B", color = "orange") %>%
add_trace(x = C$YEAR, y=C$count, type="scatter", mode = 'lines+markers', name = "C", color = "red") %>%
layout(title = "Grade change over the years",
xaxis = list(title = "Year"),
yaxis = list(title = "Number of restaurants"))